1 Climate Change and Temperature Anomalies

To study climate change, we used data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

To define temperature anomalies we need to have a reference/base period which NASA clearly states should be the period between 1951-1980.

This analysis (given by Prof. Kostis) gave us:

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")

The two objectives in this section are to:

  1. Select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment

  2. Convert the dataframe from wide to ‘long’ format ands to use the new dataframe, named tidyweather, to analyse the anomalies, assigned the variable delta.

weather_update <- weather %>% 
  select(1:13)

head(weather_update)
## # A tibble: 6 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1880 -0.34 -0.5  -0.22 -0.29 -0.05 -0.15 -0.17 -0.25 -0.22 -0.31 -0.42 -0.39
## 2  1881 -0.3  -0.21 -0.03  0.01  0.04 -0.32  0.09 -0.03 -0.25 -0.42 -0.36 -0.22
## 3  1882  0.27  0.22  0.02 -0.31 -0.24 -0.29 -0.27 -0.14 -0.24 -0.52 -0.32 -0.68
## 4  1883 -0.57 -0.65 -0.15 -0.3  -0.25 -0.11 -0.05 -0.22 -0.33 -0.16 -0.44 -0.14
## 5  1884 -0.16 -0.1  -0.64 -0.59 -0.35 -0.41 -0.44 -0.5  -0.44 -0.44 -0.57 -0.46
## 6  1885 -1    -0.44 -0.23 -0.48 -0.58 -0.44 -0.34 -0.41 -0.4  -0.37 -0.38 -0.11
tidyweather <- weather_update %>% 
  pivot_longer(cols=2:13, 
               names_to="Month", 
               values_to = "delta")
head(tidyweather)
## # A tibble: 6 × 3
##    Year Month delta
##   <dbl> <chr> <dbl>
## 1  1880 Jan   -0.34
## 2  1880 Feb   -0.5 
## 3  1880 Mar   -0.22
## 4  1880 Apr   -0.29
## 5  1880 May   -0.05
## 6  1880 Jun   -0.15

Thus the dataframe now has three variables, for:

  1. year,
  2. month, and
  3. delta, or temperature deviation.

1.1 Plotting Information

To plot the data in a time-series scatter plot, and add a trendline, we first need to create a new variable called date in order to ensure that the delta values are plotted chronologically.

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), Month, "1")),
         month = month(date, label=TRUE),
         year = year(date))

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies"
  )

Analysing this plot leads us to believe that the temperature increase is more pronounced in some months than others. Specifically, we see this when we look at these months: February, March, November and December. The graph depicts these months reaching into the 1.5-2 delta range which tells us that their temperatures were especially abnormal since most months come in at the 1-1.5 delta range.

To produce more useful plots, we will group the years into decades and filter out the decades prior to 1880.

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

head(comparison)
## # A tibble: 6 × 7
##    Year Month delta date       month  year interval 
##   <dbl> <chr> <dbl> <date>     <ord> <dbl> <chr>    
## 1  1881 Jan   -0.3  1881-01-01 Jan    1881 1881-1920
## 2  1881 Feb   -0.21 1881-02-01 Feb    1881 1881-1920
## 3  1881 Mar   -0.03 1881-03-01 Mar    1881 1881-1920
## 4  1881 Apr    0.01 1881-04-01 Apr    1881 1881-1920
## 5  1881 May    0.04 1881-05-01 May    1881 1881-1920
## 6  1881 Jun   -0.32 1881-06-01 Jun    1881 1881-1920

We can now employ this to create a density plot:

ggplot(comparison, aes(x=delta, fill=interval))+
  geom_density(alpha=0.2) +   #density plot with tranparency set to 20%
  theme_bw() +                #theme
  labs (
    title = "Density Plot for Monthly Temperature Anomalies",
    y     = "Density"         #changing y-axis label to sentence case
  )

Continuing the analysis, we will look at annual anomalies in the data, using the following code:

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  
  # creating summaries for mean delta 
  # use `na.rm=TRUE` to eliminate NA (not available) values 
  summarise(annual_average_delta= mean(delta, na.rm=TRUE )) 

head(average_annual_anomaly)
## # A tibble: 6 × 2
##    Year annual_average_delta
##   <dbl>                <dbl>
## 1  1880               -0.276
## 2  1881               -0.167
## 3  1882               -0.208
## 4  1883               -0.281
## 5  1884               -0.425
## 6  1885               -0.432
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
  geom_point()+
  
  #Fit the best fit line, using LOESS method
  geom_smooth() +
  
  #change to theme_bw() to have white background + black frame around plot
  theme_bw() +
  labs (
    title = "Average Yearly Anomaly",
    y     = "Average Annual Delta"
  )                         

1.2 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

We will thus construct a confidence interval (using both the formula and bootstrapping) for the average annual delta from 2011:

formula_ci <- comparison %>% 

# choose the interval 2011-present
  filter(interval == "2011-present") %>% 
  summarise(mean_delta = mean(delta, na.rm = TRUE),
            median_delta = median(delta, na.rm = TRUE),
            sd_delta = sd(delta, na.rm = TRUE),
            count = n(),
            t_critical = qt(0.975, count - 1),
            se_delta = sd_delta/sqrt(count),
            margin_of_error = t_critical * se_delta,
            delta_low = mean_delta - margin_of_error,
            delta_high = mean_delta + margin_of_error)

  
formula_ci
## # A tibble: 1 × 9
##   mean_delta median_delta sd_delta count t_critical se_delta margin_of_error
##        <dbl>        <dbl>    <dbl> <int>      <dbl>    <dbl>           <dbl>
## 1       1.06         1.04    0.274   132       1.98   0.0239          0.0473
## # … with 2 more variables: delta_low <dbl>, delta_high <dbl>
library(infer)
set.seed(1234)

boot_delta <- comparison %>%
  filter(interval == "2011-present") %>% 
  specify(response = delta) %>% 
  generate(reps = 1000, type = "bootstrap")%>%
  calculate(stat = "mean")

percentile_ci <- boot_delta %>% 
  get_confidence_interval (level = 0.95, type = "percentile")

percentile_ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.02     1.11

The data is showing us that the average annual delta in temperature since 2011-present is 1.059609. We know this as we took the original dataframe and filtered for the interval between 2011-present. We then used the summarize function to get different statistics such as mean, median, etc. After this we cross-checked this number by using the bootstrap function. First, we downloaded the infer package and used the set seed function to ensure the random numbers are the same. We then filtered again for the interval of 2011-present, specified the variable we wanted which was delta, generated 1000 samples, and calculated the mean of them. Finally, we got a 95% confidence interval and found out the mean of the bootstrap samples were similar to the whole population and we were satisfied.

2 Global warming and political views (GSS)

A 2010 Pew Research poll asked 1,306 Americans, “From what you’ve read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?”

In this section we analyse whether there are any differences between the proportion of people who believe the earth is getting warmer and their political ideology. As usual, from the survey sample data, we will use the proportions to estimate values of population parameters. The file has 2253 observations on the following 2 variables:

  • party_or_ideology: a factor (categorical) variable with levels Conservative Republican, Liberal Democrat, Mod/Cons Democrat, Mod/Lib Republican
  • response : whether the respondent believes the earth is warming or not, or Don’t know/ refuse to answer
global_warming_pew <- read_csv(here::here("data", "global_warming_pew.csv"))

You will also notice that many responses should not be taken into consideration, like “No Answer”, “Don’t Know”, “Not applicable”, “Refused to Answer”.

global_warming_pew %>% 
  count(party_or_ideology, response)
## # A tibble: 12 × 3
##    party_or_ideology       response                          n
##    <chr>                   <chr>                         <int>
##  1 Conservative Republican Don't know / refuse to answer    45
##  2 Conservative Republican Earth is warming                248
##  3 Conservative Republican Not warming                     450
##  4 Liberal Democrat        Don't know / refuse to answer    23
##  5 Liberal Democrat        Earth is warming                405
##  6 Liberal Democrat        Not warming                      23
##  7 Mod/Cons Democrat       Don't know / refuse to answer    45
##  8 Mod/Cons Democrat       Earth is warming                563
##  9 Mod/Cons Democrat       Not warming                     158
## 10 Mod/Lib Republican      Don't know / refuse to answer    23
## 11 Mod/Lib Republican      Earth is warming                135
## 12 Mod/Lib Republican      Not warming                     135

We will be constructing three 95% confidence intervals to estimate population parameters, for the % who believe that Earth is warming, according to their party or ideology excluding the Dont know / refuse to answer respondents.

updated_global_warming_pew <- global_warming_pew %>% 
  count(party_or_ideology, response) %>% 
  filter(response != "Don't know / refuse to answer") %>% 
  pivot_wider(names_from = response,
                values_from = n ) %>% 
  janitor::clean_names() %>% 
  mutate( total = earth_is_warming + not_warming ,
          proportion= earth_is_warming / total,
          se = sqrt(proportion * (1-proportion)/ total),
          lower = proportion - 1.96*se,
          upper = proportion + 1.96*se
  )

#updated_global_warming_pew 
print.data.frame(updated_global_warming_pew )
##         party_or_ideology earth_is_warming not_warming total proportion     se
## 1 Conservative Republican              248         450   698      0.355 0.0181
## 2        Liberal Democrat              405          23   428      0.946 0.0109
## 3       Mod/Cons Democrat              563         158   721      0.781 0.0154
## 4      Mod/Lib Republican              135         135   270      0.500 0.0304
##   lower upper
## 1 0.320 0.391
## 2 0.925 0.968
## 3 0.751 0.811
## 4 0.440 0.560

In line with The challenging politics of climate change, the respondent’s ideology is not independent of beliefs about the earth warming. We can see that Republicans (both conservative and not) are less likely to believe that the earth is warming. We can also see that the Conservative Republicans are the least likely to believe that global warming is a real issue, with only 35.5% of representatives of this ideology considering global warming a real issue.

On the contrary, democrats are more likely to believe that earth warming is real, with representatives of Liberal Democrats having the highest percentage (94.6%).

3 Biden’s Approval Margins

As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 1,819
## Columns: 22
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021"…
## $ startdate           <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate             <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/21/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade               <chr> "B", "B", "B+", "B", "B-", "B", "B+", "B-", "B", "…
## $ samplesize          <dbl> 1500, 15000, 1516, 1993, 1115, 15000, 941, 1200, 1…
## $ population          <chr> "lv", "a", "a", "rv", "a", "a", "rv", "rv", "lv", …
## $ weight              <dbl> 0.3382, 0.2594, 1.2454, 0.0930, 1.1014, 0.2333, 1.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48, 50, 45, 56, 55, 51, 63, 58, 48, 52, 53, 55, 48…
## $ disapprove          <dbl> 45, 28, 28, 31, 32, 28, 37, 32, 47, 29, 29, 33, 47…
## $ adjusted_approve    <dbl> 50.5, 48.6, 46.5, 54.6, 53.9, 49.6, 58.7, 56.9, 50…
## $ adjusted_disapprove <dbl> 38.8, 31.3, 28.4, 34.3, 32.9, 31.3, 38.0, 33.1, 40…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, TRUE, NA, NA, NA, TRUE, NA, NA, TRUE, TRUE, …
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74247, 74272, 74327, 74246, 74248, 74273, 74256, 7…
## $ question_id         <dbl> 139395, 139491, 139570, 139394, 139404, 139492, 13…
## $ createddate         <chr> "1/22/2021", "1/28/2021", "2/2/2021", "1/22/2021",…
## $ timestamp           <chr> "10:23:09  1 Oct 2021", "10:23:09  1 Oct 2021", "1…
# Use `lubridate` to fix dates, as they are given as characters.

library(lubridate)

updated_approval_polllist <- approval_polllist %>% 
  mutate(modeldate = mdy(paste(as.character(modeldate))),
         startdate = mdy(paste(as.character(startdate))),
         enddate= mdy(paste(as.character(enddate))),
         createddate= mdy(paste(as.character(createddate))),
         week= isoweek(enddate))

updated_approval_polllist
## # A tibble: 1,819 × 23
##    president subgroup modeldate  startdate  enddate    pollster grade samplesize
##    <chr>     <chr>    <date>     <date>     <date>     <chr>    <chr>      <dbl>
##  1 Joseph R… All pol… 2021-10-01 2021-01-19 2021-01-21 Rasmuss… B           1500
##  2 Joseph R… All pol… 2021-10-01 2021-01-19 2021-01-21 Morning… B          15000
##  3 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 YouGov   B+          1516
##  4 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 Morning… B           1993
##  5 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 Ipsos    B-          1115
##  6 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-22 Morning… B          15000
##  7 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-22 HarrisX  B+           941
##  8 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-23 RMG Res… B-          1200
##  9 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-24 Rasmuss… B           1500
## 10 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-23 Morning… B          15000
## # … with 1,809 more rows, and 15 more variables: population <chr>,
## #   weight <dbl>, influence <dbl>, approve <dbl>, disapprove <dbl>,
## #   adjusted_approve <dbl>, adjusted_disapprove <dbl>, multiversions <chr>,
## #   tracking <lgl>, url <chr>, poll_id <dbl>, question_id <dbl>,
## #   createddate <date>, timestamp <chr>, week <dbl>

3.1 Create a Plot (Biden’s Net Approval Ratings)

We will now create a plot of Biden’s net approval ratings since entering office, using the following plot as a reference:

new_approval_polllist <- updated_approval_polllist  %>% 
  mutate(net_approval_rate = approve - disapprove)  %>% 
  group_by(week)  %>% 
  summarise(average_net_approval_rate = mean(net_approval_rate),
            median_rate = median(net_approval_rate),
            sd_rate = sd(net_approval_rate),
            count = n(),
            t_critical = qt(0.975, count - 1),
            se_rate = sd_rate/sqrt(count),
            margin_of_error = t_critical * se_rate,
            CI_rate_low = average_net_approval_rate - margin_of_error,
            CI_rate_high = average_net_approval_rate + margin_of_error) 

head(new_approval_polllist)
## # A tibble: 6 × 10
##    week average_net_approval_rate median_rate sd_rate count t_critical se_rate
##   <dbl>                     <dbl>       <dbl>   <dbl> <int>      <dbl>   <dbl>
## 1     3                      20.2        23      7.90    28       2.05    1.49
## 2     4                      18.1        22      8.58    51       2.01    1.20
## 3     5                      16.2        20      7.82    49       2.01    1.12
## 4     6                      16.7        20.5    7.43    40       2.02    1.18
## 5     7                      16.1        19      7.48    52       2.01    1.04
## 6     8                      13.8        15.5    7.46    54       2.01    1.02
## # … with 3 more variables: margin_of_error <dbl>, CI_rate_low <dbl>,
## #   CI_rate_high <dbl>
ggplot(new_approval_polllist, aes(x = week, y = average_net_approval_rate)) +
  geom_point(color = "orange", size = 0.6) +
  geom_smooth(color = "blue", size = 0.6, se = FALSE) +
  geom_line(color = "orange", size = 0.3) +
  geom_line(aes(y = CI_rate_high), color = "orange", size = 0.3) +
  geom_line(aes(y = CI_rate_low), color = "orange", size = 0.3) +
  geom_line(aes(y = 0), color = "orange", size = 1) +
  scale_color_grey(aes(ymin = CI_rate_low, ymax = CI_rate_high)) +
  geom_ribbon(aes(ymin = CI_rate_low, ymax = CI_rate_high), alpha = 1/10) +
  theme_bw() +
  labs( y = "Average Approval Margin",
        x = "Week of the Year",
        title = "Estimating Approval Margin for Joe Biden",
        subtitle = "Weekly average of all polls") 

3.2 Compare Confidence Intervals

We were told by Prof. Kostis to leave this question due to missing data.

week_3 <- new_approval_polllist  %>% 
  filter(week == 9)  %>% 
  select (week, CI_rate_low, CI_rate_high)

week_3
## # A tibble: 1 × 3
##    week CI_rate_low CI_rate_high
##   <dbl>       <dbl>        <dbl>
## 1     9        10.7         15.0
week_25 <- new_approval_polllist  %>% 
  filter(week == 31)  %>% 
  select (week, CI_rate_low, CI_rate_high)

week_25  
## # A tibble: 1 × 3
##    week CI_rate_low CI_rate_high
##   <dbl>       <dbl>        <dbl>
## 1    31        4.97         8.09

4 Challenge 1: Excess Rentals in TfL Bike Sharing

Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following:

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-09-23T12%3A52%3A20/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20211002%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20211002T184231Z&X-Amz-Expires=300&X-Amz-Signature=7cdc644569fed5c595cc2883782e6079e173ea3375d978ab9a55c59fff9de48d&X-Amz-SignedHeaders=host]
##   Date: 2021-10-02 18:42
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 174 kB
## <ON DISK>  /var/folders/1t/0y34vl815b9fhdq__tk33y100000gn/T//RtmpifnPBS/file1a711a0014cf.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year.

Considering the May & June data, standard deviation of 2020 data seems higher compared to other years, because the lines of 2020 May & June data are flatter compared to other years. So, in all years except 2020, around 40k bikes were rented in average both in May & June. However, in 2020 data is not concentrated around the mean. The reason might be, of course, the Covid-19. So, due to the pandemic, we probably do not have enough data to drive meaningful statistics for bike rentals in May & June 2020, that’s why standard deviation for 2020 May & June data seems higher.

We will now move on to reproducing the following two graphs.

The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

For both of these graphs, you have calculated the expected number of rentals per week or month between 2016-2019 and then, see how each week/month of 2020-2021 compares to the expected rentals, using the calculation excess_rentals = actual_rentals - expected_rentals.

The mean should be used to calculate the expected rentals per month.

It is likely that the number of rentals on weekdays and weekends will vary by a lot , and the mean is able to take into account all values in the week. This gives a more accurate idea of the average number of rentals in the overall month. On the other hand, the median will only take the middle value in the data-set within the week. This is likely to be a weekday, since there are 5 weekdays in a week and 2 weekends. Median thus does not reflect the average rentals per month accurately.

head(bike)
## # A tibble: 6 × 5
##   day                 bikes_hired  year month  week
##   <dttm>                    <dbl> <dbl> <ord> <dbl>
## 1 2010-07-30 00:00:00        6897  2010 Jul      30
## 2 2010-07-31 00:00:00        5564  2010 Jul      30
## 3 2010-08-01 00:00:00        4303  2010 Aug      30
## 4 2010-08-02 00:00:00        6642  2010 Aug      31
## 5 2010-08-03 00:00:00        7966  2010 Aug      31
## 6 2010-08-04 00:00:00        7893  2010 Aug      31
updated_bike <- bike %>% 
  filter (year >2015) %>% 
  group_by(month, year) %>% 
  summarise (actual_mean_rental = mean(bikes_hired))

updated_bike_2016_2019 <- bike %>% 
  filter (year >2015 & year< 2020) %>% 
  group_by (month) %>% 
  summarise (expected_rental = mean(bikes_hired)) #expected rentals per month

final_bike <- left_join(updated_bike, updated_bike_2016_2019) %>% 
  mutate (excess_rentals= actual_mean_rental - expected_rental,
          up= ifelse(actual_mean_rental > expected_rental, actual_mean_rental, 
                     expected_rental),
          down= ifelse(actual_mean_rental < expected_rental, actual_mean_rental, 
                       expected_rental))
final_bike
## # A tibble: 68 × 7
## # Groups:   month [12]
##    month  year actual_mean_rental expected_rental excess_rentals     up   down
##    <ord> <dbl>              <dbl>           <dbl>          <dbl>  <dbl>  <dbl>
##  1 Jan    2016             18914.          20617.        -1703.  20617. 18914.
##  2 Jan    2017             20596.          20617.          -20.6 20617. 20596.
##  3 Jan    2018             20836.          20617.          218.  20836. 20617.
##  4 Jan    2019             22123.          20617.         1505.  22123. 20617.
##  5 Jan    2020             22893.          20617.         2276.  22893. 20617.
##  6 Jan    2021             13218.          20617.        -7399.  20617. 13218.
##  7 Feb    2016             20608.          22049.        -1441.  22049. 20608.
##  8 Feb    2017             22091.          22049.           42.1 22091. 22049.
##  9 Feb    2018             20587.          22049.        -1462.  22049. 20587.
## 10 Feb    2019             24961.          22049.         2912.  24961. 22049.
## # … with 58 more rows
ggplot(final_bike, aes(x=month, y= actual_mean_rental)) +
  geom_line(aes(group=1)) + #actual values
  geom_line(aes(y= expected_rental, group=1), size=0.8, color= "blue") + 
  #expected values
  facet_wrap(vars(year)) +
  theme_bw()+
  geom_ribbon(aes(ymin=expected_rental,ymax=up, group=1),fill="darkseagreen3",
              alpha=0.4)+
  geom_ribbon(aes(ymin=down,ymax=expected_rental, group=1),fill="indianred3",
              alpha=0.4) +
  labs(y = "Bike Rentals",
       title= "Monthly changes in TFL bike rentals",
       subtitle= "Change from monthly average shown in blue and calculated between 2016-2019")

bike_change <- bike %>% 
  filter((year >= 2016 & year <= 2020) | (year == 2021 & week <53)) %>%
  group_by(week,year) %>%
  mutate(bikes_each_week = sum(bikes_hired)) %>% #sum up no.of bikes in a week
  
  group_by(week) %>% 
  mutate(average_each_week = median(bikes_each_week), #median bikes in same week 
         excess_rentals_percentage = 
           (bikes_each_week - average_each_week)/average_each_week)
  replace_na(list(
    excess_rentals_percentage = 1
  )) 
## $excess_rentals_percentage
## [1] 1
print.data.frame(head(bike_change))
##          day bikes_hired year month week bikes_each_week average_each_week
## 1 2016-01-01        9922 2016   Jan   53           22062             59380
## 2 2016-01-02        7246 2016   Jan   53           22062             59380
## 3 2016-01-03        4894 2016   Jan   53           22062             59380
## 4 2016-01-04       20644 2016   Jan    1          132071            126680
## 5 2016-01-05       22934 2016   Jan    1          132071            126680
## 6 2016-01-06       23199 2016   Jan    1          132071            126680
##   excess_rentals_percentage
## 1                   -0.6285
## 2                   -0.6285
## 3                   -0.6285
## 4                    0.0426
## 5                    0.0426
## 6                    0.0426
ggplot(bike_change, aes(x = week, y = excess_rentals_percentage))+
  scale_x_continuous(breaks = seq(0, 53, by = 13)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  coord_cartesian(ylim = c(-1,1)) +
  
  geom_line(aes(y = excess_rentals_percentage)) +
  
  geom_rect(aes(xmin=13, xmax=27, ymin=-2, ymax=2), color="grey99", alpha=0.1) +
  geom_rect(aes(xmin=39, xmax=53, ymin=-2, ymax=2), color="grey99", alpha=0.1) +
  
  geom_ribbon(aes(ymin=pmin(excess_rentals_percentage,0), ymax=0), 
              fill="indianred3", alpha=0.5) +
  geom_ribbon(aes(ymin=0, ymax=pmax(excess_rentals_percentage,0)), 
              fill="darkseagreen3", alpha=0.5) +
  
  geom_rug(data = subset(bike_change, excess_rentals_percentage < 0 ), 
           color="indianred3", sides="b") +
  geom_rug(data = subset(bike_change, excess_rentals_percentage >= 0 ), 
           color="darkseagreen3", sides="b") +

  facet_wrap(~year, nrow = 2) +
  theme(legend.position="none") +
  theme_minimal() +
  labs (
    title = "Weekly changes in TfL bike rentals",
    subtitle = "% change from weekly averages calculated between 2016-2019",
    x = "week", 
    y = "",
    caption = "Source: TfL, London Data Store",
  ) +
  NULL 

5 Challenge 2: How Has the CPI and its Components Changed Over the Last Few Years?

For this challenge, we will be reproducing the following graph, but from the year 2000 onwards:

This graph has too many sub-categories, so using the relative importance of components in the Consumer Price Indexes: U.S. city average, December 2020 we chose a smaller subset of the components and only listed the major categories (Housing, Transportation, Food and beverages, Medical care, Education and communication, Recreation, and Apparel).

url <- "https://fredaccount.stlouisfed.org/public/datalist/843"

library(rvest)

tables <- url  %>% # Reading in data from site
  read_html()  %>% 
  html_nodes(css ="table")

cpi <- map(tables, . %>% 
             html_table(fill=TRUE)  %>% 
             janitor::clean_names())  # Ensuring names are in a useable format

cpi_data <- cpi[[2]] 

references <- as.vector(pull(cpi_data, series_id))    

cpi_components <- references  %>% 
  tidyquant::tq_get(get = "economic.data", from =  "1999-01-01")  %>%
  mutate(cpi_yoy_change = price/lag(price, 12) - 1)  %>%  # Change calculation
  na.omit(cpi_components)  # Get rid of any NA values

# 1999 was chosen as the year from which to pull as the lag function reads from
# 12 months prior, and so pulling from 2000 means you can only get calculable
# results from 2001 January. The NA results for 1999 are then filtered out

cpi_components_major <- cpi_components %>% 
  filter((symbol == "CPIAUCSL") | (symbol == "CPIHOSSL") | 
           (symbol == "CPITRNSL") | (symbol == "CPIFABSL") 
         | (symbol == "CPIAPPSL"))  %>% 
  dplyr::mutate(titles = case_when(symbol == "CPIAUCSL" ~ "All Items", 
                               symbol == "CPIHOSSL" ~ "Housing",
                               symbol == "CPITRNSL" ~ "Transport",
                               symbol == "CPIFABSL" ~ "Food and Beverages",
                               symbol == "CPIAPPSL" ~ "Apparel"))
ggplot(cpi_components_major, aes(x = date, y=cpi_yoy_change))+
  #scale_x_continuous(breaks = seq(0, 54, by = 13)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  coord_cartesian(ylim = c(-0.5, 0.5)) +
  
  geom_point(size = 0.5, aes(color=ifelse(cpi_yoy_change<0, "red4", 
                                          "steelblue4")),
             show.legend = FALSE) +
  geom_smooth(color = "gray") +

  facet_wrap(~titles, nrow = 2) +
  theme(legend.position = "none") +
  theme_minimal() +
  labs (
    title = "Yearly Change of US CPI and its Major Components",
    subtitle = "YOY Change Being Positive or Negative,
Jan 2000 to Aug 2021 ",
    x = "", 
    y = "YoY % Change",
    caption = "Data from FRED
https://fredaccount.stlouisfed.org/public/datalist/843",
  ) +
  NULL

6 Details

  • Who did you collaborate with: N/A
  • Approximately how much time did you spend on this problem set: 5 hours
  • What, if anything, gave you the most trouble: Challenge 2, particularly, pulling the requisite data

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

Yes